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Abstract 

This paper proposes a novel framework for 
generating lingual descriptions of indoor 
scenes. Whereas substantial efforts have 
been made to tackle this problem, previ¬ 
ous approaches focusing primarily on gen¬ 
erating a single sentence for each image, 
which is not sufficient for describing com¬ 
plex scenes. We attempt to go beyond this, 
by generating coherent descriptions with 
multiple sentences. Our approach is dis¬ 
tinguished from conventional ones in sev¬ 
eral aspects: (1) a 3D visual parsing sys¬ 
tem that jointly infers objects, attributes, 
and relations; (2) a generative grammar 
learned automatically from training text; 
and (3) a text generation algorithm that 
takes into account the coherence among 
sentences. Experiments on the augmented 
NYU-v2 dataset show that our framework 
can generate natural descriptions with sub¬ 
stantially higher ROGUE scores compared 
to those produced by the baseline. 

1 Introduction 

Image understanding has been the central goal of 
computer vision. Whereas a majority of work on 
image understanding focuses on class-based an¬ 
notation, we believe, however, that describing an 
image using natural language is still the best way 
to show one’s understanding. The task of auto¬ 
matically generating textual descriptions for im¬ 
ages has received increasing attention from both 
the computer vision and natural language process¬ 
ing communities. This is an important problem, 
as an effective solution to this problem can enable 
many exciting real-world applications, such as hu¬ 
man robot interaction, image/video synopsis, and 
automatic caption generation. 

While this task has been explored in previous 
work, existing methods mostly rely on pre-defined 
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Figure 1: Our method visually parses an RGB-D image 
to get a scene graph that represents objects, their attributes 
and relations between objects, and generates a multi-sentence 
description via a learned grammar. 


templates (Barbu et al., 2012; Krishnamoorthy et 
al., 2013), which often result in tedious descrip¬ 
tions. Another line of work solves the description 
generation problem via retrieval, where a descrip¬ 
tion for an image is borrowed from semantically 
most similar image from the training set (Ordonez 
et al., 2011; Farhadi et al., 2010). This setting is, 
however, less applicable to complex scenes com¬ 
posed of a large set of objects in diverse configu¬ 
rations, such as for example indoor environments. 

Recently, the field has witnessed a boom in gen¬ 
erating image descriptions via deep neural net¬ 
works (Kiros et al., 2014; Karpathy and Fei- 
Fei, 2014; Chen and Zitnick, 2014) which are 
able to both, learn a weak language model as 
well as generalize description to unseen images. 
These approaches typically represent the image 
and words/sentences with vectors and reason in 
a joint embedding space. The results have been 
impressive, perhaps partly due to powerful rep¬ 
resentation on the image side (Krizhevsky et al., 
2012). This line of work mainly generates a single 
sentence for each image, which typically focus on 
one or two objects and typically contain very few 
prepositional relations between objects. 

In this paper, we arc interested in generat- 














ing multi-sentence descriptions of cluttered indoor 
scenes, which is particularly relevant for indoor 
robotics. Complex, multi-sentence output requires 
us to deal with challenging problems such as con¬ 
sistent co-referrals to visual entities across sen¬ 
tences. Furthermore, the sequence of sentences 
needs to be as natural as possible, mimicking how 
humans describe the scene. This is particularly 
important for example in the context of social 
robotics to enable realistic communications. 

Towards this goal, we develop a framework 
with three major components: (1) a holistic vi¬ 
sual parser that couples the inference of objects, 
attributes, and relations to produce a semantic rep¬ 
resentation of a 3D scene (Fig. 1); (2) a gener¬ 
ative grammar automatically learned from train¬ 
ing text; and (3) a text generation algorithm that 
takes into account subtle dependencies across sen¬ 
tences, such as logical order, diversity, saliency of 
objects, and co-references. 

To test the effectiveness of our approach, we 
construct an augmented dataset based on NYU- 
RGBD (Silberman et al., 2012), where each scene 
is associated with up to 5 natural language de¬ 
scriptions from human annotators. This allows 
us to learn a language model to describe images 
the way that humans do. Experiments show that 
our method produces natural descriptions, sig¬ 
nificantly improving the F-measures of ROUGE 
scores over the baseline. 

2 Related Work 

A large body of existing work deals with images 
and text in one form or the other. The domi¬ 
nant subfield exploits text in the form of tags or 
short sentences as weak labels to learn visual mod¬ 
els (Quattoni et ah, 2007; Li et ah, 2009; Socher 
and Fei-Fei, 2010; Gupta and Davis, 2008), as 
well as attributes (Matuszek et ah, 2012; Sil- 
berer et ah, 2013). This type of approaches have 
also been explored in videos to learn visual ac¬ 
tion models from textual summaries of videos (Ra- 
manathan et ah, 2013), or learning visual concepts 
from videos described with short sentences (Yu 
and Siskind, 2013). Another direction is to ex¬ 
ploit short sentences associated with images in or¬ 
der to improve visual recognition tasks (Fidler et 
ah, 2013; Kong et ah, 2014). Just recently, an in¬ 
terested problem domain was introduced in (Mali¬ 
nowski and Fritz, 2014) with the aim to learn how 
to answer questions about images from Q&A ex¬ 


amples. In (Lin et ah, 2014), the authors address 
visual search with complex natural lingual queries. 

There has been substantial work in automat¬ 
ically generating a caption or description for a 
given image. The most popular approach has 
been to retrieve a sentence from a large corpus 
based on similarity of visual content (Ordonez et 
ah, 2011; Farhadi et ah, 2010; Kuznetsova et ah, 
2012; Rohrbach et ah, 2013; Yang et ah, 2011). 
This line of work bypasses having to deal with lan¬ 
guage template specification or template learning. 
However, typically such approaches adopt a lim¬ 
ited image representation such as triplets action- 
object-scene (Farhadi et ah, 2010). This makes 
a restrictive setting, as neither the image repre¬ 
sentation nor the retrieved sentence can faithfully 
model a truly complex scene. In (Kuznetsova et 
ah, 2014) the authors go further by only learning 
phrases from related images. 

Parallel to our work, there has been a recent 
boom in image description generation with deep 
networks (Kiros et ah, 2014; Karpathy and Fei- 
Fei, 2014; Vinyals et ah, 2014; Mao et ah, 2014; 
Donahue et ah, 2014; Fang et ah, 2014; Chen and 
Zitnick, 2014). These methods transform the im¬ 
age as well as a sentence into a vector represen¬ 
tation and learn a joint embedding between the 
two modalities. The output of these approaches 
is typically a short sentence for each image. In 
contrast, our goal here is to generate multiple de¬ 
pendent sentences that describe the salient objects 
in the scene, their properties and spatial relations. 

Generating descriptions has also been explored 
in the video domain. (Barbu et ah, 2012; Krish- 
namoorthy et ah, 2013) output a video description 
in the form of subject-action-object. In (Das et 
ah, 2013), “concept detectors” arc formed, which 
arc detectors for combined object and action or 
scene in a particular chunk of a video. Via lingual 
templates the concept detectors of particular types 
then produce cohesive video descriptions. Due to 
a limited set of concepts and templates the final de¬ 
scriptions do not seem very natural. (Rohrbach et 
ah, 2013) predicts semantic representations from 
low-level video features and uses machine transla¬ 
tion techniques to generate a sentence. 

The closest to our work is (Kulkarni et ah, 
2011; Mitchell et ah, 2012; Kuznetsova et ah, 
2014) which, like us, is able to describe objects, 
their modifiers, and prepositions between objects. 
However, our paper differs from (Kulkarni et ah, 




a microwave is above the table. 


Figure 2: The overall framework for description generation. The task consists of the training and the testing phase. In training, 
the vision models and the generative grammar are respectively learned from a set of RGB-D images and their descriptions. In 
testing, given a new image, it constructs a scene graph taking into account objects, their attributes and relationships between 
objects, and transforms it to a series of semantic trees. The learned grammar then generates textual descriptions for these trees. 


2011; Mitchell et al., 2012) in several important 
ways. In our work, we reason in 3D as opposed to 
2D giving us more natural physical interpretations. 
We aim to describe rich indoor scenes that contain 
many objects of various classes and appear in var¬ 
ious arrangements. In such a setting, describing 
every detectable object and all relations between 
them as in (Kulkarni et al., 2011) would generate 
prohibitively long, complex and unnatural descrip¬ 
tions. Our model tries to mimic what and how peo¬ 
ple describe such complex 3D scenes, thus taking 
into account visual saliency at the level of objects, 
attributes and relations, as well as the ordering and 
coherence of sentences. Another important aspect 
that sets us apart from most past work is that in¬ 
stead of using a few hand-crafted templates, we 
learn the grammar from training text. 

3 Framework Overview 

Our framework for generating descriptions for in¬ 
door scenes is based on a key rationale: images 
and their corresponding descriptions are two dif¬ 
ferent ways to express the underlying common se¬ 
mantics shared by both. As shown in Figure 2, 
given an image, it first recovers the underlying se¬ 
mantics through holistic visual analysis (Lin et al., 
2013), which results in a scene graph that captures 
detected objects and the spatial relations between 
them (e.g. on-top-of and near, etc). 

The semantics embodied by a visual scene usu¬ 
ally has multiple aspects. When describing such 
a complex scene, humans often use a paragraph 
comprised of multiple sentences, each focusing on 
a specific aspect. To imitate this behavior, this 
framework transforms the scene graph into a se¬ 
quence of semantic trees, and yields multiple sen¬ 


tences, each from a semantic tree. To make the 
results as natural as possible, we adopt two strate¬ 
gies: (1) Instead of prescribing templates in ad¬ 
vance, we learn the grammar from a training set 
- a set of RGB-D scenes with descriptions pro¬ 
vided by humans. (2) We take into account depen¬ 
dencies among sentences, including logical order, 
saliency, coreference and diversity. 

4 From RGB-D Images to Semantics 

Given an RGB-D image, we extract semantics 
through holistic visual parsing. Particularly, we 
first parse the image to obtain the objects of in¬ 
terest, their attributes, and their physical relations, 
and then construct a scene graph, which provides 
a coherent summary of these aspects. 

4.1 Holistic Visual Parsing 

To parse the visual scene we use a recently pro¬ 
posed approach for 3D object detection in RGB- 
D data (Lin et al., 2013). We briefly summarize 
this approach here. First, a set of “objectness” re¬ 
gions are generated following (Carreira and Smin- 
chisescu, 2012), which are encouraged to respect 
intensity as well as occlusion boundaries in 3D. 
These regions are projected to 3D via depth and 
then cuboids are fit tightly around them, under the 
constraint that they are parallel to the ground floor. 

A holistic CRF model is then constructed to 
jointly reason about the classes of the cuboids as 
well as the class of the scene (e.g., kitchen, bath¬ 
room). The CRF thus has a random variable for 
each cuboid representing its class, and a variable 
for the scene. To have the possibility to remove 
a bad, non-object cuboid, we have an additional 
background state for each cuboid. The model ex- 


































ploits various geometric and semantic relations by 
incorporating them into the CRF formulation as 
potentials, which arc summarized below: 

Scene Appearance. To incorporate global in¬ 
formation, a unary potential over the scene label 
is computed by means of a logistic on top of the 
scene classification score (Xiao et al., 2010). 

Cuboid class potential. Appearance-based 
classifiers, including CPMC-o2 (Carreira et al., 
2012), superpixel scores (Ren et al., 2012) are 
used to classify cuboids into a pre-defined set of 
object classes. In this paper, we additionally use 
CNN (Krizhevsky et al., 2012) features for classi¬ 
fication. The classification scores for each cuboid 
are used as different unary potentials in the CRF. 

Object geometry. Cuboids arc also classified 
based on geometric features (e.g. height , longer 
width, aspect ratio , etc) with SVM, and the classi¬ 
fication scores used as another unary potential. 

Semantic context. Two co-occurrence relation¬ 
ships arc used: scene-object and object-object. 
The potential values arc estimated from the train¬ 
ing set by counting the co-occurence frequencies. 

Geometric context. Two potentials are used 
to exploit the spatial relations between cuboids 
in 3D, encoding close-to and on-top-of relations. 
The potentials are defined to be the empirical co¬ 
occurrence frequencies for each type of relation. 

The CRF weights to combine the potentials 
arc learned with a primal dual learning frame¬ 
work (Hazan and Urtasun, 2010), and inference 
of class labels is done with an approximated algo¬ 
rithm (Schwing et al., 2011). 

4.2 Scene Graphs 

Based on the extracted visual information, we con¬ 
struct a scene graph that captures objects, their 
attributes, such as color and size, and the rela¬ 
tions between them. In particular, a scene graph 
uses nodes to represent objects and their attributes, 
and edges to represent relations between nodes. 
Here, we consider three kinds of edges: attribute 
edges that link objects to their attributes, position 
edges that represent the positions of objects rela¬ 
tive to the scene, (e.g. corner-of-room ), and pair- 
wise edges that characterize the relative positions 
between objects (e.g. on-top-of and next-to). 

Given an image, a set of objects (with class la¬ 
bels) and the scene class are obtained through vi¬ 
sual parsing as explained in the previous Section. 
However, to form a scene graph, we still need 


further analysis to extract attributes and relations. 
For each object we also compute saliency, i.e. how 
likely an object will be described. We next de¬ 
scribe how we obtain such information. 

Object attributes: For each object, we use RGB 
histograms and C-SIFT, and cluster them to ob¬ 
tain a visual word representation. We train clas¬ 
sifiers for nine colors that are most mentioned in 
the training set, as well as two material properties 
(wooden and bright). We also train classifiers for 
four different sizes {wide, tall, large, and small) 
using geometric features. To encode the correla¬ 
tions between size and the object class, we aug¬ 
ment the feature with a class indicator vector. 

Object saliency: The dataset of (Kong et al., 
2014) contains alignment between the nouns in 
a sentence and the visual objects in the scene. 
We make use of this information to train a clas¬ 
sifier predicting whether an object in the scene is 
likely to be mentioned in text. We train an SVM 
classifier using class-based features (classification 
scores for each cuboid), geometric relations (vol¬ 
ume, distance to camera), and color features. 

Object relations: We consider six types of ob¬ 
ject positions {corner-of-room, front-of-camera, 
far-away-from-camera, center-of-room, left-of- 
room, and right-of-room ), and eight types of pair- 
wise relations {next-to, near, top-of, above, in- 
front-of, behind, to-left-of, and to-right-of). We 
manually specify a few rules that help us decide 
whether a specific relation is present or not 1 . 

5 Generating Lingual Descriptions 

Given a scene graph, our framework generates a 
descriptive paragraph in two steps. First, it trans¬ 
forms the scene graph into a sequence of seman¬ 
tic trees, each focusing on a certain semantic as¬ 
pect. Then, it produces sentences, one from each 
semantic tree, following a generative grammar. 

5.1 Semantic Trees 

A semantic tree captures information such as what 
entities are being described and what are the re¬ 
lationships between them. Specifically, a semantic 
tree contains a set of terminal nodes correspond¬ 
ing to individual entities or their attributes and re¬ 
lational nodes that express relations among them. 

'We tried obtaining ground-truth for relations via MTurk 
(which would allow us to train classifiers instead), however, 
the results of all batches were extremely noisy. 



Consider a sentence “A red box is on top of a ta¬ 
ble”. The corresponding semantic tree can be ex¬ 
pressed as 

on-top-of(indet(color(box, red)), 
indet(table) ) 

This tree has three terminals: “box”, “table”, 
and “red”. The relation node “color)box, red)” 
describes the relation between “box” and “red”, 
namely, “red” specifying the color of the “box”. 
The relation “indet” qualifies the cardinality of its 
child; while “on-top-of” characterizes the spatial 
relation between its children. 

5.2 Dependencies among Sentences 

In human descriptions, sentences arc put together 
in a way that makes the resultant paragraphs co¬ 
herent. In particular, the dependencies among sen¬ 
tences, as outlined below, play a crucial role in 
preserving the coherence a descriptive paragraph: 

Logical order. When describing a scene, peo¬ 
ple present things in certain orders. The lead¬ 
ing sentence often mentions the type of the en¬ 
tire scene and one of the most salient object, 
e.g. “There is a table in the dining room.” 

Diversity. People generally avoid using the 
same prepositional relation in multiple sentences. 
Also, when an object is mentioned in multiple sen¬ 
tences, it usually plays a different role, e.g. “There 
is a table near the wall. On top of the table is a mi¬ 
crowave oven.” Here, “table” respectively serves 
as a source and a target in these two sentences 2 . 

Saliency. Saliency influences the order of sen¬ 
tences. The statistics in (Kong et al., 2014) shows 
that bigger objects arc often mentioned earlier on 
in a description and co-referred across sentences, 
e.g. one would say “This room has a dining table 
with a mug on top. Next to the table is a chair.” 
and not “There is a mug on a table. Next to the 
mug is a chair.” Saliency also depends on context, 
e.g. for bathrooms, toilets arc often mentioned. 

Co-reference. When an object is mentioned for 
the second time following its debut, a pronoun is 
often used to make the sentence concise. 

Richness vs. Conciseness. When talking about 
an object for the first time, describing its color/size 
makes the sentence interesting and informative. 
However, this is generally unnecessary the next 
time the object is mentioned. 

2 Each relation is considered as an edge. For example, in 
phrases “A on-top-of B" and “A near B”, “A” is considered 
as the source, while “B " considered as the target. 


5.3 From Scene Graphs to Semantic Trees 

Motivated by these considerations, we devise a 
method below that transforms a scene graph into 
a sequence of semantic trees, each for a sentence. 

First of all, we initialize wf = w\ = s* • c,. 
Here, wf and wf arc the weights that respectively 
control how likely the i-th object will be chosen 
as a source or a target in the next sentence; s, is 
a positive value measuring the saliency of the i-th 
object, while c, is given by the classifier to indi¬ 
cate its confidence as to whether it makes a correct 
prediction of the object’s class. These weights arc 
updated as the generation proceeds. 

To generate the leading sentence, we first draw a 
source i with a probability proportional to wf, and 
create a semantic tree by choosing a relation, say 
“in”, which would lead to a sentence like “There 
is a table in the dining room.” Once the i-th ob¬ 
ject is chosen to be a source, wf will be set to 
0, precluding it from being chosen as a source 
again. However, wf remains unchanged, as it re¬ 
mains fine for it to serve as a target later. 

For each subsequent sentence, we draw a source 
i, a target j, and a relation r between i and j, with 
probability proportional to wfwfp r , where p r is 
the prior weight of the relation r. At each iteration, 
one may also choose to terminate without generat¬ 
ing a new sentence, with a probability proportional 
to a positive value r. These choices together result 
in a semantic tree in the form of “r(makeJree(i), 
makeJree(j))”. Here, “makeJree(i)” creates a 
sub-tree describing the object i, which may be “in- 
det(color(table, black))” when the color is known. 

After the generation of this semantic tree, the 
weights wf, wfj, and p r will be set to zero to pre¬ 
vent the objects i and j from being used again for 
the same role, and the relation r from being cho¬ 
sen next time. Our algorithm also takes care of 
co-references - if an object is selected again in the 
next sentence, it will be replaced by a pronoun. 

5.4 Grammar and Derivation 

Given a semantic tree, our framework produces a 
sentence following a generative grammar, namely, 
a map from each semantic relation to a set of tem¬ 
plates (i.e. derivation rules), as illustrated below: 

indet —> a {1} 

color —> {2} {1} 

on-top-of —> {1} is on top of {2} 

On top of {2} is {1} 

There is {1} on top of {2} 

Each template has a weight that is set to its fre- 



on-top-of a red box is on top of ft tablet- —>. | {1} is on top of {2} 



Figure 3: The process to derive templates by matching se¬ 
mantic nodes to parts of the sentence. Starting from the root 
node, the learning algorithm identifies the ranges of words 
corresponding to the child nodes, and replaces them with a 
placeholder to obtain a template. This proceeds downward 
recursively until all relation nodes are processed. 

quency in the training set. The generation of a 
sentence from a semantic tree proceeds from the 
root, and downward recursively to the terminals. 
For each relation node, a template will be chosen, 
with a probability proportional to the associated 
weight. Below is an example showing how a sen¬ 
tence is derived following the grammar above. 

{on-top-of(indet(color(box, red)), 
indet(table))} 

=> {indet(color(box, red))} is on top of 
{indet(table)} 

=> a {color (box, red) } is on top of a table 
=> a red box is on top of a table 

As the choices of templates for relational nodes 
are randomized, different sentences can be derived 
for the same tree, with different probabilities. 


inite article will be translated into an det and indet 
relation node; two nouns or noun phrases “A ” and 
“B” linked by a prepositional link “above” will 
be translated into “above(A, B)”. 

With a sentence and a semantic tree constructed 
thereon, we can derive the template through re¬ 
cursive matching, where matched children are re¬ 
placed by a placeholder, while other words are pre¬ 
served literally in the template. Figure 3 illustrates 
this procedure. We collect templates respectively 
for each relation, and set the weight of each tem¬ 
plate to its frequency. Empirically, we observed a 
long tailed distribution - a small number of com¬ 
mon templates occur many times, while a dom¬ 
inant portion of templates are used sporadically. 
To improve the reliability, we discard all the tem¬ 
plates that occur less than 5 times and all relations 
whose total weight is less than 20. 

6 Experimental Evaluation 

We test the proposed framework on the NYU-v2 
dataset (Silberman et al., 2012) augmented with 
an additional set of textual descriptions, one for 
each image. Particularly, we focus on assessing 
both the relevance and quality of the generated de¬ 
scriptions. 


5.5 Learning the Grammar 

The grammar for generating sentences are often 
specified manually in previous work (Barbu et ah, 
2012; Das et al., 2013). This way, however, is time 
consuming, unreliable, and tends to oversimplify 
the language. In this work, we explore a new ap¬ 
proach, that is, to learn the grammar from data. 
The basic idea is to construct a semantic tree from 
each sentence through linguistic parsing, and then 
derive the templates by matching nodes of the se¬ 
mantic tree to parts of the sentence. 

First, we use the Stanford parser (Toutanova et 
al., 2003) to obtain a parse tree for each sentence, 
which is then simplified through a series of fil¬ 
tering operations. For example, we merge noun 
phrases (e.g. “fire distinguisher”) into a single 
node and compress common prepositional phrases 
(e.g. “in the left of’) into a single link. 

A semantic tree can then be derived by re¬ 
cursively translating the simplified trees. This 
is straightforward. For example, a noun “box” 
with an adjective “red” will be translated into 
“colorfbox, red)”\ a noun with a definite or indef- 


6.1 Data Preparation 

The NYU-v2 dataset has 1449 RGB-D images of 
indoor scenes (e.g. dining rooms, kitchens, of¬ 
fices). These images are divided into a training 
a testing set, following the partition used in (Lin 
et al., 2013). The training set contains 795 scenes, 
while the testing set contains the remaining 654. 
We use the descriptions from (Kong et al., 2014) 
which were collected by asking MTurkers to de¬ 
scribe the image to someone who does not see it 
in order to provide him/her with a vidid impres¬ 
sion of the scene. The number of sentences per 
description ranges from 1 to 10 with an average of 
3. There are on average 40 words in a description. 

We learn the generative grammar using the al¬ 
gorithm described in Section 5.5 from the train¬ 
ing set of descriptions. We also train the CRF for 
visual analysis and apply it to detect objects and 
predict their attributes and relations, following the 
procedure described in Section 4.1. These models 
are then used to produce textual descriptions for 
each test scene. 
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config 

[ ROUGE 1 

ROUGE2 

ROUGES | 

R 

P 

F 

R 

P 

F 

R 

P 

F 

baseline 

0.3000 

0.2947 

0.2968 

0.0667 

0.0657 

0.0661 

0.1026 

0.1006 

0.1014 

GT 

L0 

0.3332 

0.3249 

0.3281 

0.0786 

0.0765 

0.0773 

0.1372 

0.1334 

0.1348 

GT 

LI 

0.3378 

0.3294 

0.3327 

0.0838 

0.0816 

0.0824 

0.1397 

0.1359 

0.1373 

GT 

L2 

0.3392 

0.3308 

0.3340 

0.0849 

0.0827 

0.0835 

0.1409 

0.1370 

0.1385 

GT 

L3 

0.3770 

0.3676 

0.3712 

0.1092 

0.1067 

0.1076 

0.1629 

0.1584 

0.1601 

GT 

L4 

0.3775 

0.3680 

0.3716 

0.1064 

0.1040 

0.1049 

0.1598 

0.1554 

0.1570 

GT 

L5 

0.3755 

0.3658 

0.3695 

0.1008 

0.0984 

0.0993 

0.1563 

0.1519 

0.1536 

Real 

L0 

0.3243 

0.3161 

0.3192 

0.0752 

0.0735 

0.0742 

0.1306 

0.1270 

0.1283 

Real 

LI 

0.3347 

0.3266 

0.3296 

0.0814 

0.0795 

0.0802 

0.1362 

0.1325 

0.1338 

Real 

L2 

0.3338 

0.3256 

0.3286 

0.0816 

0.0796 

0.0803 

0.1356 

0.1319 

0.1332 

Real 

L3 

0.3641 

0.3541 

0.3580 

0.1045 

0.1019 

0.1029 

0.1546 

0.1499 

0.1517 

Real 

L4 

0.3663 

0.3560 

0.3600 

0.1039 

0.1011 

0.1022 

0.1534 

0.1486 

0.1504 

Real 

L5 

0.3675 

0.3570 

0.3611 

0.1021 

0.0994 

0.1004 

0.1526 

0.1478 

0.1496 


Table 1: ROGUE scores for the baseline and our approach under configurations at different levels. Here, “GT" and “Real” 
respectively refer to the results obtained based on annotated objects and objects detected by the visual parsing method. For 
each ROGUE metric, we report the recall (R), precision (P), and F-scores (F) averaged over all scenes and 10 randomized runs. 


6.2 Performance Metrics 

To evaluate our method, we look for metrics typi¬ 
cally used in machine translation. These include 
the BLEU (Papineni et al., 2002) and ROUGE 
metrics among others. The BLEU score measures 
precision on n-grams, and is thus less suitable for 
our goal of lingual image description, as already 
noted in (Mitchell et al., 2012; Das et al., 2013). 
On the other hand, ROUGE is an n-gram recall 
oriented measures which evaluates the information 
coverage between summaries produced by the hu¬ 
man annotators and those automatically produced 
by systems. ROUGE-1 (unigram) recall is the best 
option to use for comparing descriptions based 
only on predicted keywords (Das et al., 2013). 
ROUGE-2 (bigram) and ROUGE-SU4 (skip-4 bi¬ 
gram) are best to evaluate summaries with respect 
to coherence and fluency. We use the ROUGE 
metrics following (Das et al., 2013) who uses it 
to evaluate lingual video summarization. 

6.3 Comparison of Results 

The proposed text generation method has five 
optional switches, controlling whether the fol¬ 
lowing features arc used during generation: (1) 
diversity: encourage diversity of the sen¬ 
tences by suppressing the entities and rela¬ 
tions that have been mentioned; (2) saliency: 
draw salient objects with higher probability; (3) 
scene: leading sentence mentions the class of 
the scene; (4) attributes: use colors and sizes 
to describe objects when they arc available; (5) 
coreference: use a pronoun to refer to an 
object when it is mentioned in the previous sen¬ 
tence. Our experiments test the framework un¬ 
der six feature-levels, level-0 to level-5, where the 
level-A; configuration uses the first k features when 


generating the sentences. In particular, level-0 
uses none of the features above, and thus each sen¬ 
tence is generated independently using the gram¬ 
mar; while level-5 uses all of these features. 

To put our performance in perspective, we com¬ 
pare our method to an intelligent baseline which 
follows a conventional approach in description 
generation. The baseline describes an image by re¬ 
trieving visually the most similar image from the 
training set, and simply using its corresponding 
description. To compute our baseline, we use a 
battery of visual features such as spatial pyramids 
of SIFT, HOG, LBR geometric context, etc, and 
kernels with different distances. We use (Xiao et 
al., 2010) to compute the kernels. Based on a com¬ 
bined kernel, we simply retrieve the training image 
with the highest matching score. 

Table 1 shows the results. We evaluate two set¬ 
tings: using ground-truth objects (denoted with 
GT) and using the results obtained via the vi¬ 
sual parser (denoted with Real). We can see that 
the proposed method significantly outperforms the 
baseline in all three ROGUE measures. Also, con¬ 
figurations above level 3 are clearly better than 
level 1 and 2, which indicates that a special leading 
sentence that gives an overview of the scene is im¬ 
portant for description generation. In addition, we 
observe that there arc is a noticeable improvement 
from level 3 to level 4 and 5. This is not surpris¬ 
ing: whereas attributes and coreference improve 
the quality of descriptions by making them richer 
and less verbose, such improvement on quality 
does not contribute substantially to the ROGUE 
score that arc based on n-gram comparisons. 

Figure 4 shows descriptions generated using our 
approach on a diverse set of scenes. It can be 
seen that linguistic issues such as sentence diver- 





T here is a brown bed in the bedroon*. 
Tta U iti front of a headboard. 
Hear M-e be ij a blinds. W-e can j-e-e 
a brown curtail near /•«■« 

Ttar-e is a ck-ei/- n-ear M-e A-ea^koaraf. 



A woo^-en curtain is in M-e bedroon*. 
The curtail* is on top of a woode n 
headboard. W-e can j-e-e a in 
front of the curtail a. T(i-e headboard 
is near Mi-* 


r~' l 

iiJki 

n 

k | 

I; 

St. 

m 


1 1 * tl i-e kitchen, there is a 
refrigerator. A gree n is n-ear 

a ^raj^ . Hear the refrigerator is 
the . U/-e can j-e-e a n-ucrou/a^-e 

n-ear Hi-e . "TTi-e ov'-en u behind 

the refrigerator. 



There is a in the living roon*. 
Behind the is a white cabinet. 
U/-e can see a black chair in front of 
the cabinet. There is a n*antel near 

the chair. 



In the bedrooh*, there is a sofa. U/-e 
can see a brown on the right of 
it. In front of it is a black bed. 


In the hon*e office, there is a chair. 
It is near a cabinet. t\Ae can see a 
brown table in front of it. 


In the bathroom, there is a 
In front of a toilet is a sink. The 
toilet is in front of the . A 

shelf is in front of the 


In the hon*e office, there is a brown 
shelf. U/-e can j-e-e a in front of 

it. Hear it is a bright printer. 






In the office, there is a . U/-e ^ the living ro on*, there is a monitor. In the kitchen, there is a .A There is a white counter in the 
can see a cabinet in front of the The n*onitor is behind a chair, W-e can cabinet is behind a sofa. The sofa is kitchen. The counter is near a white 
. U/-e can j-e-e a n-ear the *** the n*onitor on top of a near the . cabinet. Hear a refrigerator is the 

. In the office, there is a table. There is the near the monitor. cabinet. U/-e can j-e-e a green 

The chair is near the . on right of the cabinet. The 

refrigerator is near a shelf. 


Figure 4: This Figure shows several examples of the descriptions generated using the proposed frame¬ 
work. In the top two rows the method builds on the ground-truth cuboids, while the bottom row shows 
the results using the visual parser. Note that in the case of GT, the input to the method is the full set of 
GT objects for the image, thus the method still needs to take into account the saliency of what to talk 
about. We color-code object cuboids and nouns referring to them in text. 


sity, using attributes to describe objects, and using 
pronouns for coreferences have been properly ad¬ 
dressed. However, there remain some problems 
that need future efforts to address. For example, 
since the choices of templates for different sen¬ 
tences are independent, sometimes an unfortunate 
selection of a template sequence may make the 
paragraph slightly unnatural. 

7 Conclusion 

We presented a new framework for generating nat¬ 
ural descriptions of indoor scenes. Our framework 


integrates a CRF model for visual parsing, a gener¬ 
ative grammar automatically learned from training 
descriptions, as well as a transformation algorithm 
to derive semantic trees from scene graphs, which 
takes into account the dependencies across sen¬ 
tences. Our experiments show substantially bet¬ 
ter descriptions than those produced by a baseline. 
Such findings indicate that high quality descrip¬ 
tion generation requires not only reliable image 
understanding, but also delicate attention to lin¬ 
guistic issues, such as diversity, coherence, and 
logical order of sentences. 
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